Search CORE

FigShare

A reexamination of information theory-based methods for DNA-binding site identification

Author: A Kolb
AR Fernandez De Henestrosa
B Barash
CE Lawrence
CE Shannon
D Betel
D GuhaThakurta
DT Pride
EN Trifonov
ET Jaynes
ET Jaynes
G Robertson
G Thijs
GD Stormo
GD Stormo
GD Stormo
GE Crooks
GJ Phillips
GZ Hertz
I Erill
Ivan Erill
J Rudnick
J van Helden
JJ Kohler
JM Heumann
JT Kim
JW Gibbs
K Gaston
K Uchida
KL Griffith
L Kozobay-Avraham
LJ Sun
LL Gatlin
LL Gatlin
M Abella
M Asayama
M Butala
M Schnarr
MC O'Neill
MC O'Neill
MC O'Neill
MH Zweig
Michael C O'Neill
ML Bulyk
MS Gelfand
N Baichoo
O Aparicio
O Huisman
OG Berg
OG Berg
P D'Haeseleer
PH von Hippel
PH von Hippel
R Brent
R Jauregui
R Munch
R Munch
R Osada
R Staden
RJ Redfield
RK Shultzaberger
RK Shultzaberger
RK Shultzaberger
RV Parbhane
S Krishna
S Kullback
ST Cole
TD Schneider
TD Schneider
TD Schneider
TD Schneider
TD Schneider
TL Bailey
TL Bailey
X Liu
Z Chen
Z Xiaoyue
Publication venue: BioMed Central
Publication date: 01/02/2009
Field of study

Abstract Background Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. Results Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as <it>Relative Entropy</it>, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. Conclusion We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.</p

Is Thermosensing Property of RNA Thermometers Unique?

Author: A Avihoo
Alexander S. Spirin
B Voss
DH Mathews
DW Selinger
F Narberhaus
G Hambraeus
G Kudla
H Salgado
H Yuzawa
I Hofacker
JA Bernstein
K Nakahigashi
M Huynen
M Kozak
MH de Smit
Michael A. Gilchrist
MT Morita
Premal Shah
RK Shultzaberger
S Chowdhury
S Chowdhury
S Wuchty
T Hughes
T Waldminghaus
T Waldminghaus
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

A large number of studies have been dedicated to identify the structural and sequence based features of RNA thermometers, mRNAs that regulate their translation initiation rate with temperature. It has been shown that the melting of the ribosome-binding site (RBS) plays a prominent role in this thermosensing process. However, little is known as to how widespread this melting phenomenon is as earlier studies on the subject have worked with a small sample of known RNA thermometers. We have developed a novel method of studying the melting of RNAs with temperature by computationally sampling the distribution of the RNA structures at various temperatures using the RNA folding software Vienna. In this study, we compared the thermosensing property of 100 randomly selected mRNAs and three well known thermometers - rpoH, ibpA and agsA sequences from E. coli. We also compared the rpoH sequences from 81 mesophilic proteobacteria. Although both rpoH and ibpA show a higher rate of melting at their RBS compared with the mean of non-thermometers, contrary to our expectations these higher rates are not significant. Surprisingly, we also do not find any significant differences between rpoH thermometers from other -proteobacteria and E. coli non-thermometers

University of Tennessee, Knoxville: Trace

CiteSeerX

An iterative strategy combining biophysical criteria and duration hidden Markov models for structural predictions of Chlamydia trachomatis σ66 promoters

Author: A Agresti
A Kanhere
AA Afifi
AG Pedersen
AM Huerta
B Grech
BW Brunelle
C Bi
CB Harley
David H Ardell
David M Ojcius
DK Hawley
DW Hosmer
E Niehus
E Niehus
GEP Box
GZ Hertz
H Wang
J SantaLucia Jr
LM Iyer
M Tan
M Towsey
MC O'Neill
MG Munteanu
MG Reese
N Uljanov
PM Bavoil
PS Hefty
R Staden
R Wagner
RJ Belland
RK Shultzaberger
Ronna R Mallios
RR Picard
S Burden
S Lisser
S Mathews
SC Satchwell
SR Eddy
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Promoter identification is a first step in the quest to explain gene regulation in bacteria. It has been demonstrated that the initiation of bacterial transcription depends upon the stability and topology of DNA in the promoter region as well as the binding affinity between the RNA polymerase σ-factor and promoter. However, promoter prediction algorithms to date have not explicitly used an ensemble of these factors as predictors. In addition, most promoter models have been trained on data from <it>Escherichia coli</it>. Although it has been shown that transcriptional mechanisms are similar among various bacteria, it is quite possible that the differences between <it>Escherichia coli </it>and <it>Chlamydia trachomatis </it>are large enough to recommend an organism-specific modeling effort. Results Here we present an iterative stochastic model building procedure that combines such biophysical metrics as DNA stability, curvature, twist and stress-induced DNA duplex destabilization along with duration hidden Markov model parameters to model <it>Chlamydia trachomatis </it>σ66 promoters from 29 experimentally verified sequences. Initially, iterative duration hidden Markov modeling of the training set sequences provides a scoring algorithm for <it>Chlamydia trachomatis </it>RNA polymerase σ66/DNA binding. Subsequently, an iterative application of Stepwise Binary Logistic Regression selects multiple promoter predictors and deletes/replaces training set sequences to determine an optimal training set. The resulting model predicts the final training set with a high degree of accuracy and provides insights into the structure of the promoter region. Model based genome-wide predictions are provided so that optimal promoter candidates can be experimentally evaluated, and refined models developed. Co-predictions with three other algorithms are also supplied to enhance reliability. Conclusion This strategy and resulting model support the conjecture that DNA biophysical properties, along with RNA polymerase σ-factor/DNA binding collaboratively, contribute to a sequence's ability to promote transcription. This work provides a baseline model that can evolve as new <it>Chlamydia trachomatis </it>σ66 promoters are identified with assistance from the provided genome-wide predictions. The proposed methodology is ideal for organisms with few identified promoters and relatively small genomes.</p

Pacific McGeorge School of Law

Scholarly Commons

Compensatory Evolution of Gene Regulation in Response to Stress by Escherichia coli Lacking RpoS

Author: A Farewell
A Mira
AM Dean
Charles J. Dorman
CL Burch
D Groth
Daniel M. Stoebel
David S. Guttman
DB Haniford
DJ Galas
ER Zinser
FB Moore
FC Neidhardt
G Becker
GK Smyth
GK Smyth
GK Smyth
H Weber
HA Orr
I Hautefort
J Bender
J Mellies
JE Karlinsey
JH Miller
JL Brissette
KA Datsenko
Karsten Hokamp
L Notley-McRobb
LN Csonka
M Feldgarden
M Yamagishi
Michael S. Last
MW Pfaffl
NA Moran
P Siguier
R Hengge-Aronis
RA Fisher
RA Irizarry
RC Massey
RE Lenski
RE Lenski
RE Lenski
RK Shultzaberger
RW Simons
S Zhong
SA Sawyer
SE Finkel
SF Elena
SJ Schrag
T Ferenci
T King
T King
T Nystrom
V Robbe-Saule
VS Cooper
W Klein
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The RpoS sigma factor protein of Escherichia coli RNA polymerase is the master transcriptional regulator of physiological responses to a variety of stresses. This stress response comes at the expense of scavenging for scarce resources, causing a trade-off between stress tolerance and nutrient acquisition. This trade-off favors non-functional rpoS alleles in nutrient-poor environments. We used experimental evolution to explore how natural selection modifies the regulatory network of strains lacking RpoS when they evolve in an osmotically stressful environment. We found that strains lacking RpoS adapt less variably, in terms of both fitness increase and changes in patterns of transcription, than strains with functional RpoS. This phenotypic uniformity was caused by the same adaptive mutation in every independent population: the insertion of IS10 into the promoter of the otsBA operon. OtsA and OtsB are required to synthesize the osmoprotectant trehalose, and transcription of otsBA requires RpoS in the wild-type genetic background. The evolved IS10 insertion rewires expression of otsBA from RpoS-dependent to RpoS-independent, allowing for partial restoration of wild-type response to osmotic stress. Our results show that the regulatory networks of bacteria can evolve new structures in ways that are both rapid and repeatable

Scholarship@Claremont

CiteSeerX

Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes

Author: A Bolotin
A Henne
A Sola-Landa
AC Kaberdina
AL Delcher
B Chang
BE Moseley
CJ Wu
D Benelli
DE Andreev
E Torarinsson
FD Ciccarelli
Gang-Qing Hu
GE Crooks
GP van Wezel
GQ Hu
GQ Hu
GR Janssen
H Chen
H Nothaft
HJ Hong
HQ Zhu
HQ Zhu
Huaiqiu Zhu
I Moll
J Besemer
J Ma
J Shine
JA Lake
JS Hahn
K Chin
M Brenneis
M Jiang
M Kozak
M Ptashne
M Ventura
MA Larkin
MM Slupska
MN Price
MS Paget
N Tolstrup
NJ Ryding
O Hering
P Dam
P Londei
PA Hoskisson
R Hershberg
RK Shultzaberger
RL Tatusov
S Grill
S Kumar
S Nakagawa
T Sazuka
T Udagawa
T Umeyama
TB Anderson
V Mazurakova
WP Revill
Xiaobin Zheng
Zhen-Su She
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Shine-Dalgarno (SD) signal has long been viewed as the dominant translation initiation signal in prokaryotes. Recently, leaderless genes, which lack 5'-untranslated regions (5'-UTR) on their mRNAs, have been shown abundant in archaea. However, current large-scale <it>in silico </it>analyses on initiation mechanisms in bacteria are mainly based on the SD-led initiation way, other than the leaderless one. The study of leaderless genes in bacteria remains open, which causes uncertain understanding of translation initiation mechanisms for prokaryotes. Results Here, we study signals in translation initiation regions of all genes over 953 bacterial and 72 archaeal genomes, then make an effort to construct an evolutionary scenario in view of leaderless genes in bacteria. With an algorithm designed to identify multi-signal in upstream regions of genes for a genome, we classify all genes into SD-led, TA-led and atypical genes according to the category of the most probable signal in their upstream sequences. Particularly, occurrence of TA-like signals about 10 bp upstream to translation initiation site (TIS) in bacteria most probably means leaderless genes. Conclusions Our analysis reveals that leaderless genes are totally widespread, although not dominant, in a variety of bacteria. Especially for <it>Actinobacteria </it>and <it>Deinococcus-Thermus</it>, more than twenty percent of genes are leaderless. Analyzed in closely related bacterial genomes, our results imply that the change of translation initiation mechanisms, which happens between the genes deriving from a common ancestor, is linearly dependent on the phylogenetic relationship. Analysis on the macroevolution of leaderless genes further shows that the proportion of leaderless genes in bacteria has a decreasing trend in evolution.</p

Design Parameters to Control Synthetic Gene Expression in Escherichia coli

Author: A Bjornsson
A Eyre-Walker
A Fuglsang
A Henaut
A Villalobos
Alan Villalobos
Austin Gurney
BJ Del Tito Jr.
C Gustafsson
Claes Gustafsson
CM Stenstrom
CM Stenstrom
CM Stenstrom
DH Mathews
E Gonzalez de Valdivia
EI Gonzalez de Valdivia
G Kudla
G Wu
G Wu
GA Gutman
Grzegorz Kudla
GT Chen
H Dong
H Dong
I Iost
J Bonomo
J Elf
J Liao
J Newcomb
JC Venter
Jeremy Minshull
JF Kane
JH Holland
Jon E. Ness
K Itakura
KA Dittmar
L Blanco
L Eriksson
M Graf
M Welch
MA Sørensen
Mark Welch
MV Rojiani
NA Burgess-Brown
P Rice
PJ Dillon
PM Sharp
R Reynolds
RK Shultzaberger
S Boycheva
S Wold
Sridhar Govindarajan
SW Harcum
VR Kaberdin
VR Kaberdin
Y Sohn
Publication venue: Public Library of Science
Publication date: 01/09/2009
Field of study

BACKGROUND:Production of proteins as therapeutic agents, research reagents and molecular tools frequently depends on expression in heterologous hosts. Synthetic genes are increasingly used for protein production because sequence information is easier to obtain than the corresponding physical DNA. Protein-coding sequences are commonly re-designed to enhance expression, but there are no experimentally supported design principles. PRINCIPAL FINDINGS:To identify sequence features that affect protein expression we synthesized and expressed in E. coli two sets of 40 genes encoding two commercially valuable proteins, a DNA polymerase and a single chain antibody. Genes differing only in synonymous codon usage expressed protein at levels ranging from undetectable to 30% of cellular protein. Using partial least squares regression we tested the correlation of protein production levels with parameters that have been reported to affect expression. We found that the amount of protein produced in E. coli was strongly dependent on the codons used to encode a subset of amino acids. Favorable codons were predominantly those read by tRNAs that are most highly charged during amino acid starvation, not codons that are most abundant in highly expressed E. coli proteins. Finally we confirmed the validity of our models by designing, synthesizing and testing new genes using codon biases predicted to perform well. CONCLUSION:The systematic analysis of gene design parameters shown in this study has allowed us to identify codon usage within a gene as a critical determinant of achievable protein expression levels in E. coli. We propose a biochemical basis for this, as well as design algorithms to ensure high protein production from synthetic genes. Replication of this methodology should allow similar design algorithms to be empirically derived for any expression system

Genome-Wide Identification of Transcription Start Sites, Promoters and Transcription Factor Binding Sites in E. coli

Despite almost 40 years of molecular genetics research in Escherichia coli a major fraction of its Transcription Start Sites (TSSs) are still unknown, limiting therefore our understanding of the regulatory circuits that control gene expression in this model organism. RegulonDB (http://regulondb.ccg.unam.mx/) is aimed at integrating the genetic regulatory network of E. coli K12 as an entirely bioinformatic project up till now. In this work, we extended its aims by generating experimental data at a genome scale on TSSs, promoters and regulatory regions. We implemented a modified 5′ RACE protocol and an unbiased High Throughput Pyrosequencing Strategy (HTPS) that allowed us to map more than 1700 TSSs with high precision. From this collection, about 230 corresponded to previously reported TSSs, which helped us to benchmark both our methodologies and the accuracy of the previous mapping experiments. The other ca 1500 TSSs mapped belong to about 1000 different genes, many of them with no assigned function. We identified promoter sequences and type of σ factors that control the expression of about 80% of these genes. As expected, the housekeeping σ70 was the most common type of promoter, followed by σ38. The majority of the putative TSSs were located between 20 to 40 nucleotides from the translational start site. Putative regulatory binding sites for transcription factors were detected upstream of many TSSs. For a few transcripts, riboswitches and small RNAs were found. Several genes also had additional TSSs within the coding region. Unexpectedly, the HTPS experiments revealed extensive antisense transcription, probably for regulatory functions. The new information in RegulonDB, now with more than 2400 experimentally determined TSSs, strengthens the accuracy of promoter prediction, operon structure, and regulatory networks and provides valuable new information that will facilitate the understanding from a global perspective the complex and intricate regulatory network that operates in E. coli

Digital.CSIC

Identification and functional characterization of small non-coding RNAs in Xanthomonas oryzae pathovar oryzae

Abstract Background Small non-coding RNAs (sRNAs) are regarded as important regulators in prokaryotes and play essential roles in diverse cellular processes. <it>Xanthomonas oryzae </it>pathovar <it>oryzae </it>(<it>Xoo</it>) is an important plant pathogenic bacterium which causes serious bacterial blight of rice. However, little is known about the number, genomic distribution and biological functions of sRNAs in <it>Xoo</it>. Results Here, we performed a systematic screen to identify sRNAs in the <it>Xoo </it>strain PXO99. A total of 850 putative non-coding RNA sequences originated from intergenic and gene antisense regions were identified by cloning, of which 63 were also identified as sRNA candidates by computational prediction, thus were considered as <it>Xoo </it>sRNA candidates. Northern blot hybridization confirmed the size and expression of 6 sRNA candidates and other 2 cloned small RNA sequences, which were then added to the sRNA candidate list. We further examined the expression profiles of the eight sRNAs in an <it>hfq </it>deletion mutant and found that two of them showed drastically decreased expression levels, and another exhibited an Hfq-dependent transcript processing pattern. Deletion mutants were obtained for seven of the Northern confirmed sRNAs, but none of them exhibited obvious phenotypes. Comparison of the proteomic differences between three of the ΔsRNA mutants and the wild-type strain by two-dimensional gel electrophoresis (2-DE) analysis showed that these sRNAs are involved in multiple physiological and biochemical processes. Conclusions We experimentally verified eight sRNAs in a genome-wide screen and uncovered three Hfq-dependent sRNAs in <it>Xoo</it>. Proteomics analysis revealed <it>Xoo </it>sRNAs may take part in various metabolic processes. Taken together, this work represents the first comprehensive screen and functional analysis of sRNAs in rice pathogenic bacteria and facilitates future studies on sRNA-mediated regulatory networks in this important phytopathogen.</p

Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars

Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. These capabilities are illustrated by simple example grammars expressing how gene expression rates are dependent upon single or multiple parts. The translation process is validated by systematically generating, translating, and simulating the phenotype of all the sequences in the design space generated by a small library of genetic parts. Attribute grammars represent a flexible framework connecting parts with models of biological function. They will be instrumental for building mathematical models of libraries of genetic constructs synthesized to characterize the function of genetic parts. This formalism is also expected to provide a solid foundation for the development of computer assisted design applications for synthetic biology